Goto

Collaborating Authors

 data practitioner


The Evolution of LLM Adoption in Industry Data Curation Practices

Qian, Crystal, Liu, Michael Xieyang, Reif, Emily, Simon, Grady, Hussein, Nada, Clement, Nathan, Wexler, James, Cai, Carrie J., Terry, Michael, Kahng, Minsuk

arXiv.org Artificial Intelligence

As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created 'golden datasets' with LLM-generated 'silver' datasets and rigorously validated 'super golden' datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.


The data practitioner for the AI era

MIT Technology Review

Data practitioners are among those whose roles are experiencing the most significant change, as organizations expand their responsibilities. Rather than working in a siloed data team, data engineers are now developing platforms and tools whose design improves data visibility and transparency for employees across the organization, including analytics engineers, data scientists, data analysts, machine learning engineers, and business stakeholders. This report explores, through a series of interviews with expert data practitioners, key shifts in data engineering, the evolving skill set required of data practitioners, options for data infrastructure and tooling to support AI, and data challenges and opportunities emerging in parallel with generative AI. The report's key findings include the following: This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review's editorial staff.


Qrlew: Rewriting SQL into Differentially Private SQL

Grislain, Nicolas, Roussel, Paul, Agathe, Victoria de Sainte

arXiv.org Artificial Intelligence

This paper introduces Qrlew, an open source library that can parse SQL queries into Relations -- an intermediate representation -- that keeps track of rich data types, value ranges, and row ownership; so that they can easily be rewritten into differentially-private equivalent and turned back into SQL queries for execution in a variety of standard data stores. With Qrlew, a data practitioner can express their data queries in standard SQL; the data owner can run the rewritten query without any technical integration and with strong privacy guarantees on the output; and the query rewriting can be operated by a privacy-expert who must be trusted by the owner, but may belong to a separate organization.


Breaking The AI Bias: How To Define Fairness To Deliver Fairer Models

#artificialintelligence

At a basic level, AI learns from our history. Unfortunately, much of societal history includes some discrimination and inequality. It's therefore essential that data practitioners consider this in their work as AI built without acknowledgement of bias will replicate and even exacerbate this discrimination. This is particularly concerning when you consider the influence AI is already exerting over our lives. McKinsey's recent digital trust survey found that less than a quarter of executives are actively mitigating against risks posed by AI models (this includes fairness and bias).


Synthetic Data and the Data-centric Machine Learning Life Cycle

#artificialintelligence

In this series of posts, we'll cover how Gretel's synthetic data platform helps you overcome challenges across the data-centric machine learning life cycle to help you successfully build, deploy, maintain, and realize value from your AI projects. The life cycle outlined below is a common framework or workflow process for building machine learning and AI solutions. It's focused on streamlining the stages necessary to develop machine learning models, deploy them to production, and maintain and monitor them. These steps are a collaborative process, often involving data scientists and DevOps engineers. The process below was inspired by the value chains created by The Sequence, Databricks, Google Cloud, and Microsoft.


Getting a Peak of the Big Data/Cloud Computing Workflow Using AWS

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Although I've had the chance now to play with these different technologies, I'm still amazed by the convenience, portability, and computing power that Big Data and Cloud Computing technologies offer, both to consumers and businesses.


Top 15 Books to Master Data Strategy - KDnuggets

#artificialintelligence

If you're a data practitioner with your eye on a leadership role, learning Data Management will be an important step toward getting you where you want to go. In this article, we outline 15 books on topics ranging from Data Architecture (highly technical) to Data Literacy (broadly nontechnical) to help you improve your understanding of end-to-end best practices related to data. Summary: I'd be remiss if I didn't begin this list here. This behemoth covers 14 practical topics related to Data Strategy, followed by 3 topics related to implementation. The 14 different knowledge areas are best represented by the Aiken Pyramid, which outlines how these topics build upon each other.


A better way to browse the web for data practitioners

#artificialintelligence

We are trying to imagine the smoothest way to solve this problem and capture the web efficiently, but we need your support and your feedback to make it happen. This is the part where the information comes to you, because we're not always looking for stuff. On your home page, you would be able to access general reading (and watching, and listening) recommendations, based on your interests and latest readings. Tell us what you're working on or where you are stuck and our AI search agent will find the right content. We are the first search engine dedicated entirely to AI content: you may not know a concept, we will suggest and define it for you.


6 productivity tips for beginner data scientists

#artificialintelligence

Tips that will fast track productivity in your data science journey as a beginner. I could remember, When I wanted to learn data science, machine learning, I was also curious about specific things I need to do to fast-track myself while I just started since having passed that stage and have more experience. I will be sharing some tips that will help beginners in their journey from my experience In data science. In this article, You will understand ways to improve yourself as an aspiring or beginner data scientist. I will explain six important productivity tips to improve yourself as a beginner, junior, undergraduate, or aspiring data scientist.


Getting the most from your data-driven transformation: 10 key principles

MIT Technology Review

The importance of data to today's businesses can't be overstated. Studies show data-driven companies are 58% more likely to beat revenue goals than non-data-driven companies and 162% more likely to significantly outperform laggards. Data analytics are helping nearly half of all companies make better decisions about everything, from the products they deliver to the markets they target. Data is becoming critical in every industry, whether it's helping farms increase the value of the crops they produce or fundamentally changing the game of basketball. Used optimally, data is nothing less than a critically important asset. Problem is, it's not always easy to put data to work. The Seagate Rethink Data report, with research and analysis by IDC, found that only 32% of the data available to enterprises is ever used and the remaining 68% goes unleveraged.